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Abstract 

We have recently developed analysis methods (GREML) to estimate the genetic variance of a complex trait/disease and the 
genetic correlation between two complex traits/diseases using genome-wide single nucleotide polymorphism (SNP) data in 
unrelated individuals. Here we use analytical derivations and simulations to quantify the sampling variance of the estimate 
of the proportion of phenotypic variance captured by all SNPs for quantitative traits and case-control studies. We also derive 
the approximate sampling variance of the estimate of a genetic correlation in a bivariate analysis, when two complex traits 
are either measured on the same or different individuals. We show that the sampling variance is inversely proportional to 
the number of pairwise contrasts in the analysis and to the variance in SNP-derived genetic relationships. For bivariate 
analysis, the sampling variance of the genetic correlation additionally depends on the harmonic mean of the proportion of 
variance explained by the SNPs for the two traits and the genetic correlation between the traits, and depends on the 
phenotypic correlation when the traits are measured on the same individuals. We provide an online tool for calculating the 
power of detecting genetic (co)variation using genome-wide SNP data. The new theory and online tool will be helpful to 
plan experimental designs to estimate the missing heritability that has not yet been fully revealed through genome-wide 
association studies, and to estimate the genetic overlap between complex traits (diseases) in particular when the traits 
(diseases) are not measured on the same samples. 
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Introduction 

Genome-wide association studies (GWAS) have been extremely 
successfully in identifying genetic variants associated with complex 
traits and diseases in humans [1]. In GWAS, hundreds of 
thousands or millions of SNPs are tested one by one for statistical 
evidence of association with a trait, and to avoid false positive 
discoveries due to the very large number of statistical tests being 
conducted, usually a very stringent p-value threshold, e.g. 5x10 8 , 
is used to report a significant finding. Therefore, if there are many 
genes each with a small effect affecting the trait, most of these 
genetic variants will fail to pass the stringent threshold and remain 
undetected. This is one of the explanations of the 'missing 
heritability' question, that genetic variants identified from GWAS 
so far explain a fraction of the heritability for complex traits [2]. 
We proposed a method, which is able to estimate the total amount 
of variance explained by all SNPs together without testing the 
SNPs individually for a quantitative trait [3], and subsequently 
extended it to the estimation of missing heritability for binary 
disease data from ascertained case-control studies [4]. The 
analyses until recently only included common SNPs (e.g. minor 
allele frequency >0.01). The estimate quantifies the overall 



contribution from the additive effects of all SNPs, which is the 
upper limit of the proportion of variance that is captured by the 
additive effects of the set of SNPs used in the estimation, and is also 
the lower limit of the narrow-sense heritability of the trait. We also 
extended the method to estimate the genetic correlation between 
two traits using SNP data [5,6]. In contrast to the traditional 
(co)variance estimation methods that rely on pedigree information 
(family/ twin studies), our method uses unrelated samples from a 
general population and the genetic (co)variance is estimated using 
a genetic relationship matrix (GRM) estimated from SNPs. The 
estimate of genetic variance using SNP data in unrelated 
individuals is free of confounding from common environment 
effects shared between close relatives that are difficult to model in 
family-based analyses, and is directly comparable to results from 
GWAS, because both are based on the same experimental design. 
For multiple trait analysis, the SNP-based approach allows the 
estimation of the genetic correlation between complex traits 
measured on different samples [6,7]_ENREF_8. This is important 
in particular for estimating the genetic correlation between 
diseases because multiple diseases are unlikely to co-segregate in 
sufficiently large pedigrees to allow estimation using traditional 
pedigree design. The SNP-based method has the flexibility of 
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Author Summary 

Genome-wide association studies (GWAS) have identified 
thousands of genetic variants for hundreds of traits and 
diseases. However, the genetic variants discovered from 
GWAS only explained a small fraction of the heritability, 
resulting in the question of "missing heritability". We have 
recently developed approaches (called GREML) to estimate 
the overall contribution of all SNPs to the phenotypic 
variance of a trait (disease) and the proportion of genetic 
overlap between traits (diseases). A frequently asked 
question is that how many samples are required to 
estimate the proportion of variance attributable to all 
SNPs and the proportion of genetic overlap with useful 
precision. In this study, we derive the standard errors of 
the estimated parameters from theory and find that they 
are highly consistent with those observed values from 
published results and those obtained from simulation. The 
theory together with an online application tool will be 
helpful to plan experimental design to quantify the 
missing heritability, and to estimate the genetic overlap 
between traits (diseases) especially when it is unfeasible to 
have the traits (diseases) measured on the same individ- 
uals. 



estimating the genetic correlation between any two diseases using 
completely independent case-control data. Other methods to 
estimate genetic parameters from individual-level or summary 
GWAS data have also been reported [8-10]. 

We previously named the SNP-based method mentioned 
GREML [11], as a complement to GBLUP [12] where variance 
components are assumed known, and have been implemented 
them in the software tool GCTA [13]. One outstanding question is 
the statistical power of detecting genetic variation using the 
population-based estimation method, for example how many 
samples are required to achieve estimates that are sufficiendy 
accurate to detect genetic (co)variance of complex traits. In this 
paper, we derive the sampling variance of the estimate of genetic 
(co)variance by analytical derivations and verify our derivations by 
simulations under a range of scenarios. We also provide an online 
tool for power calculation. 

Methods and Results 

Univariate analysis 

The methods of using SNP data to estimate genetic variance in 
unrelated individuals have been detailed elsewhere [3,13]. In brief, 
given GWAS data, we can model the phenotype as 

y=g+e (1) 

where y is an jVx 1 vector of phenotypes with jV being the sample 
size, g is an jVx 1 vector with each of its elements being the total 
genetic effect of an individual captured by all SNPs, and e is an 
jVxl vector of residuals. We have g~A r (0,c-QA) and e^A^O^I), 
where Oq is the genetic variance captured by all SNPs, A is the 
genetic relationship matrix (GRM) estimated from SNPs [3] , a 2 is 
the residuals variance and I is an identity matrix. The genetic 
relationships, also known as 'genomic relationships' or 'genetic 
similarity relationships', are referenced to the current population, 
and so can be negative as they are distributed about a mean of 
zero. Equation (1) is a typical mixed linear model with 
var(y) = Cq A + u^I, in which the variance components can be 



estimated using a restricted maximum likelihood (REML) 
approach [13,14]. The proportion of variance explained by all 
SNPs (SNP heritability) is denned as h^ = a 2 a /(a 2 G + a 2 e ). 

For power calculation, we need to know the sampling variance 
of the estimate of Cq, i.e. varfffg). In practice, the asymptotic 
sampling variance (standard error squared) of a variance 
component is calculated from a diagonal element of the inverse 
of the information matrix in maximum likelihood analysis [15-18]. 
Each element of the information matrix, however, comprises 
complex forms of matrix algebra including a matrix inverse. It is 
therefore unfeasible to derive var(ffQ) directly from the inverse of 
the information matrix. We show below an equivalent approach to 
obtain var(o'Q) under the simple regression framework. 

For unrelated individuals, where the phenotypic correlation 
between individuals is small, mixed linear model analysis using the 
REML approach is asymptotically equivalent to simple regression 
analysis of pairwise phenotypic similarity/difference on pairwise 
genetic similarity, as measured by identity-by-descent (IBD) or 
identity-by-state (IBS) at genome-wide markers [17-20]. Under 
such circumstance, a regression of the cross-product of the 
phenotypes is equivalent to using both the squared difference 
and squared sum of the pairwise phenotypes, and using the cross- 
product is equivalent to using maximum likelihood [19]. The 
model for the regression-based analysis can be written as 

z ij =n+bA i j + e i j (2) 

where Zy =y>iyj with y, and yj being the phenotypes of individuals i 
and j (i>j), Ay is the ij-th element of the GRM A, and sy is the 
residual of this regression. There are n = N(N — \)/2xN 2 /2 
observations (contrasts) in the regression. The regression coeffi- 
cient b is equivalent to a 2 G because 

b = co\{Ay,yiyj)/var(Ay) = cov(Ay,g^j)/var(Ay) 
= E(Ayg igj )/v a r(Ay) = 4E(4)/var(^) 

=4 

In such a simple regression, the sampling variance of the estimate 
of the regression coefficient is 

var(ff^) = var(6) = var(£;,)/[« vai(Ay)\ (3) 

If the samples are unrelated and the phenotypes have been 
standardized with mean of 0 and variance of 1 , then E(Zy) = 0 and 
var(z,y) = 1. Since var(^4,y) is small, there is hardly any variance in 
Zy that can be explained by Ay so that var(s/,)«var(z,y)= 1. We 
therefore have 

var(^)«2/[7V 2 var(^ v -)] (4) 

Under circumstances when varL4,y) is large, for example when the 
GRM is calculated from pedigree data, a substantial proportion of 
variance in Zy could be explained by Ay, so that var(gy) will be 
smaller than var(Zy) and the sampling variance of estimate of 
genetic variance will be reduced accordingly. In general, var(Ay) 
and the residual variance in equation (2) depend on the number of 
SNP that are used to calculate the GRM and their correlation 
structure. Although wdx(Ay) can be calculated empirically from 
the data, theoretical work suggest it is approximately 2xl0 -5 for 
genome-wide coverage of common SNPs in human populations 
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[21]. Since the phenotypic variance is usually estimated with very 
high precision, 



"yf 








.12. 




. f7 GlG2 A + Cr ele2 I 


ff G2 A + (J e2 l 



var(h 2 a )xvnr(a 2 a )»2/[N 2 var(^y)] = W 5 /N 2 (5) 

This suggests that the standard error (SE) of A G depends only on 
sample size, and is approximately 316/7V. We show by simulations 
based on real genotype data (Text SI) that this approximation is 
very accurate (Figure 1 and Table SI). The SEs calculated from 
the approximation theory are also highly consistent with those 
reported from our previous studies for human height and body 
mass index (BMI). For example, the reported SE of A G for height 
was 0.083 using 3925 unrelated samples [3] and 0.029 for both 
height and BMI, irrespective to A G , using 1 1586 unrelated samples 
[22], and the SE calculated from the approximation theory is 
0.081 for JV= 3925 and 0.027 for JV= 1 1586. 

Bivariate analysis (traits measured on the same 
individuals) 

For a bivariate analysis where the two traits are measured on the 
same individuals, the mixed linear model can be written as [6] 



y t = gj + ei f or trait # 1 and y 2 = g 2 + ei f or trait #2 (6) 



where ff G1G2 is the genetic covariance between the two traits and 
°ele2 ' s m e residual covariance. The genetic variance and 
covariance components can also be estimated using REML [6]. 

The genetic correlation is estimated as r G = ov 

Since r G is a non-linear function of ff G1G 

is no explicit derivation for v3x(o G \ G2 / ^Jo^o^y). Reeve (1955) 

and Robertson (1959) provided an approximation of var(r G ) 
in the context of balanced pedigree design as 

(l-r G )Vvar(A G1 )var(A G2 ) 

Koots and 



2 

G2- 



T G1 and <T G2 , th ere 



Gibson 
(1->- 2 p) 2 



2/i G] /? G2 
(1996) 



— [23,24] and 

proposed a modified 



var(/z Gl )var(/i G2 ) 
2/! g1 /j G2 



[25], where rp is the phenotypic 



correlation between the traits. However, both approximations 
have an unsatisfying property that var(r G ) will approach 0 if r G or 
rp is close to l. We derived an approximation, which does not 
have this problem (Text S2), i.e. 



where yi and y 2 are jVx l vector of phenotypes, gi and g 2 are JVx l 
vectors of genetic effects with gj ~ N(0,a Gl A) and 
g 2 ~ N(0,Oq 2 A), ei and e 2 are JVxl vectors of residuals with 
ei ~7y(0,<7^jl) and ~iV(0,ff 2 2 I), and jVis the sample size. The 
variance covariance matrix is 



var(r G ) i 



{\-r a r v ) 2 + (r a -r v f 
h 2 ax h 2 G2 N 2 war(Aij) 



(7) 



As described above, var(^4,y) x 2 x 10 5 for a GRM estimated from 
common SNPs in unrelated individuals in human populations, 



Observed (heritability = 0.2) 
Observed (heritability = 0.5) 
Observed (heritability = 0.8) 
Approximation theory 




1000 1500 2000 2500 3000 3500 4000 4500 5000 
Sample size 



Figure 1. Standard error of the estimate of variance explained by all SNPs vs. sample size. The first three columns are the averaged 
standard error observed from 1 00 simulations under three heritability levels. The last column is the predicted standard error from our approximation 
theory. The plotted data can be found in Table SI. 
doi:1 0.1 371 /journal.pgen.1 004269.g001 
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therefore SE(r G ) ? 



224 
N 



(1 - r a rp) 2 + (r G - r P ) 2 
^Gl^G2 



When 



rG = rp = 0, i.e. the traits are completely independent, 
224 

SE(r G ) ~ -. We tested equation (7) by simulations 

N^h 2 ax h 2 G2 

based on real genotype data (Text SI). The simulation results 
suggest that the approximation is reasonably accurate (Table 1). 
For real data analysis, we previously estimated the genetic 
correlation between intelligence at age 1 1 years and in old age 
of 0.62 with a SE of 0.23 using 1729 samples [5], consistent with 
the predicted SE of 0.22 from the approximation theory. 

Bivariate analysis (traits measured on different sets of 
individuals) 

For a bivariate analysis where the two traits are measured on 
different sets of individuals, e.g. height in males and blood pressure 
in females, the variance-covariance matrix is [6] 



"yf 




^GlAl+^cl 1 ! 


°G1G2 A 12 


.yi. 




. f7 GlG2 A 12 


(7 G2 A 2 + ^ 2 I 2 



where y! is an JV\ x 1 vector of phenotypes in sample set # 1 (e.g. 
males), and y 2 is an jV 2 x 1 vector of phenotypes for in sample set 
#2 (e.g. females), with jV\ and jV 2 being the sample sizes of the two 
sets. Ai is an M\ xjV\ GRM for individuals in sample set #1, A 2 is 
an jV 2 xjV 2 GRM in sample set #2 and A 12 is an jV\ X jV 2 GRM 
between the two sets of samples. c G1 and a G2 are the genetic 
variance for the two traits, c^i an d a \i are the residual variances 



with the corresponding identify matrix I t and I 2 . ff c 



the 



genetic covariance between traits. Since the two traits are 
measured on different sets of samples, the residual covariance is 
ignored because it is assumed that there is no covariance between 
the unrelated individuals apart from that caused by genetic factors. 

The genetic correlation is also estimated as f G = <T G1G2 / yj G G \G G2 , 

however, the sampling variance of r G is different from that 
described above. Since the traits are measured in different sets of 
samples, cov(a G1 ,a 2 G2 ) = cov(ff G1 ,ff G1G2 ) = cov(ff G2 ,ff G1G2 ) = 0. 
Therefore, from a second order Taylor series approximation 
[15] 



var(r G )«rQ 



var(ff G1 ) var(g G2 ) var(a G1G2 )' 



a 2 

°G1G2 



(8) 



This approximation involves the sampling variance of <7 G1G2 . We 
show below an equivalent approach to obtain var(<7 G1G2 ). 

Analogous to the univariate analysis, estimation of genetic 
covariance by a bivariate mixed linear model analysis is 
asymptotically equivalent to the following linear regression model 



Zl2(;/) =H + bA 12(i;) + % 



(9) 



where Zi2(y) =y\(i)y 2 (j) i.e. the product of phenotypes between the 
i-th individual in set #1 and the J-th individual in set #2, and 
A\2(ij) is the ij-th element of the GRM A 12 , i.e. the genetic 
relationship between the ?-th individual in set 1 and the j-th 
individual in sample set #2. The regression coefficient is 
equivalent to genetic covariance between the two traits because 



6 = cov(4 2 (,;»JW2( ; 9/ var (^i2(^ 
= ^{An(ij)g\(t)g2(j))/y^{A m]) ) = o G i G2 E{A\ m )/\&r{A mj) ) 

= CTG1G2 

If the two sample sets are independent and phenotypes for both 
traits have been standardized with mean of 0 and variance of 1 , 
then E(zi2( ; y)) = 0 and var(zi2(,y)) = 1 . Since varL4i2(,y)) is small, 
var(£,y)«var(zi2(y))=l. We then have var(d- G i G2 )«var(E,y)/ 
[NiN 2 var(^i2 ( y))] « \/[NiN 2 vax(A m j))]. 

We know from the derivations above that var(<T G1 )« 
2/[N 2 var(y4i(,y))] and var(cr G2 )«2/[Af var(^2(//))]- For unrelated 
individuals sampled from the same population, var(yli(,y)) = 
var(^2(i») = var(^4i2(i/)) = var(y4i / ), we therefore get 



var(r G ) f 



r G (N 2 h G1 +N 2 h G2 ) + 2h 2 al h 2 G2 N l N 2 
Ih^h^NlNlv^j) 



(10) 



This was also tested by simulations (Text SI) and the approx- 
imated standard errors were highly consistent with those observed 
from simulations, especially when sample size was large (Table 1). 
When r G = rp = 0, i.e. two traits are completely independent, 
var(r G )« l/[/i G1 /! G2 A r2 var(^,y)] for traits measured on the same 
sample, and var(r G )« \/[h Gl h G2 Ni N 2 varL4,y)] for traits mea- 
sured on different samples. Therefore, for independent traits, the 
ratio of sampling variance of genetic correlation between the two 
traits measured on the same sample to that on different samples is 
simply N X N 2 /N 2 . 

Case-control studies 

For case-control studies, the proportion of variance in case- 
control status (0 or 1) that is explained by all SNPs on the observed 
scale (A 0 ) can be estimated using a linear model [4] . Therefore, the 
same approximations to the sampling variance of genetic variance 
and genetic correlation for quantitative traits can be applied 
directly to case-control studies. As shown in equation (5), in a 
univariate analysis, the sampling variance of SNP-based heritabil- 
ity depends only on sample size and variance in genetic 
relatedness, independent of the properties of the phenotype, so 
that var(/i 0 ) is also approximately 2/ [A 2 varL4,y)] in a case-control 
study with N being the total number of cases and controls. We 
show in Table 2 that the observed standard errors of the estimates 
of from published studies are highly consistent with those 
predicted from our approximation theory. 

To calculate power, however, we would need to specify h G , 
which is a parameter with non-intuitive properties, and depends 
on the prevalence of the disease in the population (K), the 
proportion of variance in disease liability that is captured by the 
SNPs at population level, and the proportional of cases in the 
sample (»). For this reason we define as the variance explained 
by all SNPs at the population level on the unobserved underlying 
scale of disease liability, and use a linear transformation to 
transform to h 2 ^ on the liability scale [4], i.e. h\^ = ch 2 ^ and 
var(A 2 j ) = c 2 var(A 2 ) ) with c = (l -K) 2 /[v{\ - v)i 2 ]. We then get 



var , 12 , = 2(l-iQ 4 (Ar case + A control ) 2 



(11) 



where A" (:ase is the number of cases, jV (:<mtrol is the number of 
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Table 2. Standard errors of the estimates of variance explained by all SNPs on the observed scale (ft 0 ) from published analyses of 
case-control studies for a number of diseases vs. those predicted from the approximation theory. 



Disease 


A/cases 


^controls 


Prevalence 




SE(Obs.) 


SE(Approx.) 


A/I iilfinl^ e^lArAcic l"3 A\ 
Ivl u 1 11 |J Itr sLlciUsls L->^J 


1 604 


1 953 


0 001 


0 851 


0 088 


0 089 


Alzheimer's disease [34] 


3290 


3849 


0.020 


0.364 


0.049 


0.044 


Endometriosis [34] 


3154 


6981 


0.080 


0.231 


0.036 


0.031 


Schizophrenia [7] 


9087 


12171 


0.010 


0.410 


0.015 


0.015 


Bipolar disorder [7] 


6704 


9031 


0.010 


0.441 


0.021 


0.020 


MDD [7] 


9041 


9381 


0.150 


0.177 


0.017 


0.017 


ASD [7] 


3303 


3428 


0.010 


0.310 


0.046 


0.047 


ADHD [7] 


4163 


12040 


0.050 


0.253 


0.020 


0.020 



W cases : number of cases. A/ contro | S : number of controls. SE(Obs.): reported standard error of the estimate of Iiq from real data analysis. SE(Approx.): standard error of Iiq 
calculated from our approximation theory. MDD: major depression disorder. ASD: autism spectrum disorders. ADHD: attention-deficit/hyperactivity disorder. 
doi:1 0.1 371 /journal.pgen.1 004269.t002 



controls, and i is the selection intensity which is a function of K [4] . study on different sets of samples, 
We illustrate in Figure 2 the dependency of the SE of Ir[ on disease 



prevalence (K) and proportion of cases in the sample (v) due to the 
transformation. 

As shown in equation (10), in a bivariate analysis where traits 
are measured on different sets of samples, the sampling variance of 
genetic correlation depends on sample sizes, trait heritabilities and 
the genetic correlation parameter, which is also independent of the 
properties of the phenotypes. Therefore, in a bivariate analysis of 
two independent case-control disease studies, 



var(r G ) f 



ri(Nfh 4 01 +Nih' 02 )- 



■2h 2 0l h 2 0Z NiN 2 



2h 4 0l h 4 02 NfN 2 var( Ay) 



(12) 



where JVj and jV 2 are the total numbers of cases and controls of the 
two case-control studies, respectively. This also applies to a 
bivariate analysis of a quantitative trait and a cases-control disease 



var(r G )? 



r 2 G (N 2 h* 0 + N 2 h G ) + 2h 2 0 h 2 G m N 2 



(13) 



These two equations can also be expressed with respect to iv\, 
given Iiq = lv[/c (see above). We show in Table 3 that the reported 
SEs of Yq from bivariate analyses of psychiatric diseases are also 
highly in line with the predicted SEs from the approximation 
theory. 



Statistical power 

Statistical power is calculated from the population value of the 
parameter and its sampling variance, which was derived above. If 
the parameter is 9, where 8 is either the proportion of phenotypic 
variance captured by SNPs (hh) in the univariate case or the 



0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
0 



v = 0.2, K = 0.1 

- • v = 0.5, K = 0.1 

v = 0.2, K = 0.001 
•••• v = 0.5, K = 0.001 




2000 4000 6000 8000 
Total number of cases and controls 



10000 



Figure 2. Standard error (SE) of the estimate of variance explained by all SNPs on the underlying scale [hj] from a univariate 
analysis of a case-control study vs. total number of cases and controls (sample size). The SE is predicted from the approximation theory 
given different levels of disease prevalence (K) and proportion of cases in the sample (v). 
doi:1 0.1 371 /journal.pgen.1 004269.g002 
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asymptotically distributed as a non-central ^ with 
freedom and non-centrality parameter (NCP) of X ~- 
Given X and the type-I error rate of a, statistical power is the 
probability that a non-central variable is larger than the central 
X threshold that is determined by a. We show in Figure 3 the 
statistical power based on the sampling variance from our 
approximation theories to detect in a univariate case and 
in a bivariate case under a range of scenarios. For example, for a 
quantitative trait, approximately 8900, 4500, 3000 and 2300 
independent individuals are required to detect Iiq of 0.1, 0.2, 0.3 
and 0.4 with >80% power at a type-I error rate of 0.05, 
respectively. For two quantitative traits measured on the same 
sample, approximately 7000, 4700, 2500 and 1600 independent 
individuals are required to detect tq of 0.2, 0.4, 0.6 and 0.8 with > 
80% at a type-I error rate of 0.05, respectively. 

Online tool 

We have also developed an online calculator (GCTA Power 
Calculator, http://spark.rstudio.com/ctgg/gctaPower), as part of 
the GCTA [13] software package (http://ctgg.qbi.uq.edu.au/ 
software/gcta), using R-Shiny (http:/ /shiny.rstudio.org) to calcu- 
late the SE of genetic variance or genetic correlation and statistical 
power given user-defined parameters. 

Discussion 

We have derived the approximate sampling variance of the 
estimate of variance explained by all common SNPs (Aq) for a 
quantitative trait or case-control study of a disease, and genetic 
correlation (tq) between two quantitative traits, between two 
diseases, or between a quantitative trait and a disease, using 
genome-wide SNP data in unrelated individuals. We believe that 
the derivations and the online tool will be helpful for researchers to 
determine how many samples are required to detect Aq (or ro) and 
to estimate /i G (or ro) with adequate precision before collecting the 
genotype data. 

The sampling variance of for a complex trait is inversely 
proportional to sample size (jV) and the variance in SNP-based 
genetic relatedness (var(^4,y)), and independent of A G . The 
sampling variance of ?g between two complex traits is a function 
of ro, jVof the two samples, A G of the two traits and var(^4/ ; ) when 
the traits are measured on different samples, and further depends 
on the phenotypic correlation (rp) when traits are measured on the 
same samples. All the approximation theories apply to case-control 
studies of diseases since the case-control data can be analysed 
using a linear model on the observed 0—1 scale. The sampling 
variance for the estimate on the observed scale can then be 
transformed to that on the underlying liability scale using well- 
established theory. The standard errors (square root of sampling 
variance) of either A G or fa observed in published studies were all 
highly consistently with those predicted from our approximation 
theories, which were also confirmed by simulations based on real 
genotype data. 

Analytical expressions for the sampling variance of the estimates 
of genetic (co)variance from pedigree analyses have been around for 
over 50 years [17,26], and statistical power can be derived from 
these by using the sampling variance and population value of the 
parameter. However, these expressions are typically for specific 
structured pedigrees, such as fullsib or halfsib families or twin pairs. 
There are to our knowledge no simple approximations for general 
pedigrees, because the inverse of the variance-covariance matrix 
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Figure 3. Statistical power of detecting genetic variance (correlation) under different study designs, a) Univarite analysis of a 
quantitative trait, b) Univariate analysis of a case-control study assuming equal number of cases and controls (v = 0.5) and heritability of liability (hj) 
of 0.2. c) Bivariate analysis of two quantitative traits measured on the same set of individuals, assuming heritability of 0.2 for both traits, d) Bivariate 
analysis of two case-control studies on independent sets of samples, assuming equal numbers of cases and controls for each disease, and equal 
sample size (total number of cases and controls), equal heritability of liability (h^ = 0.2) and equal prevalence (K=0.01) for both diseases. 
doi:10.1371/journal.pgen.1004269.g003 



is required and this is conditional on the actual pedigree 
structure. The sampling variance of the estimated parameters in 
a general complex pedigree is usually derived post hoc after the 
analysis has been performed. 

Methods for calculating the power of detecting quantitative trait 
loci (QTL) in family-based linkage studies have been investigated 
extensively in the past two decades [16-18,27]. These methods 
were developed to calculate the power of detecting a QTL but can 
be generalized for variance components estimation, e.g. estimating 
the genetic variance using pedigree information. The non-centrality 
parameter of the test-statistic from a maximum likelihood analysis of 
variance components is NCP = E(21nL!) — E(21nL 0 )= — E 
(ln|Vi|) + ln|Vo|, where L is the likelihood function, and Vo and 
V] are the variance covariance matrix under the null and 
alternative hypotheses respectively [17,18]. For a specific balanced 
pedigree design, e.g. fullsibs or nuclear families, the determinant (or 
inverse) of the V matrix can be computed explicitly, so that the NCP 
can be calculated without making approximation [16,17]. For an 



arbitrary pedigree, ln|V| can be calculated approximately using 
Taylor expansions given the variance in family relatedness [18,27]. 
Therefore, all these methods explicitly or implicitly require a known 
pedigree. When the correlations between relatives are small, the first 
order approximation of the NCP in Rijsdijk et al [18] can be written 
in our notations as NCP« J] var^/,)/^ « A^ 2 /)q var(^4//)/2, 

<>j 

which is the same as we derived (i.e. ti^/wiix(h^), see Equation 
(4) for var(/!Q)), even though our deviations were based on least 
squares regression analysis in unrelated samples whereas the 
derivations in Rijsdijk et al [18] were based on maximum likelihood 
approach in family data. This approximation is reasonably accurate 
when correlations between relatives are small for a pedigree-based 
design, which is not an issue for a population-based design where 
the genetic relationships between unrelated samples are very small 
as demonstrated in Yang et al [3] . We show by simulations (Text SI) 
that for a univariate analysis the LRT statistics calculated based on 
REML are highly consistent with the chi-squared test-statistics 
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calculated by the Wald test using the sampling variance either 
observed from the simulations or predicted from our approximation 
theory (Figure SI). 

For a given population, a set of common SNPs and the method 
of calculating the genetic relationship matrix that we have used 
here, var(Ay) is a fixed quantity because it depends only on 
effective population size of the human populations [28]. We used 
var(A ij ) = 2x 1(T 5 , which was calculated from theory based upon 
an effective population size of 10,000. Variance in genetic 
relatedness (and therefore power of detection) can decrease by 
including many rare SNPs in calculating the GRM because adding 
more rare SNPs increases the effective population size reflecting 
recent population expansion. The variance in relatedness can also 
increase by sampling closer relatives (see below for more 
discussion) or, for example, by creating a relationship matrix 
based upon haplotype information. Modifying the GRM can also 
affect the variance of the off-diagonal elements. For example by 
applying a weighting of SNPs depending on linkage disequilibrium 
the variance in the estimates of genetic relationships will decrease 
so that the sampling variance of the estimate of SNP-based 
heritability will be increased [29]. Although we derive the theory 
and show the results based on the SNPs on the whole genome, our 
approximation theories are also applicable in analyses using a 
subset of SNPs, e.g. SNPs from a single chromosome. In that case, 
w&x(Aij) used in the approximation equations should be either 
observed empirically from data or derived from theory [28] based 
on the subset of SNPs. 

If there are unknown related samples in the data (cryptic 
relatedness), Wq will possibly be inflated due to shared environ- 
ment between close relatives and/or the effects of causal variants 
in LD with the SNPs but captured by family relatedness, and 
var(^Q) will be deflated due to the increase of var(yl,y). In fact, the 
interpretation of changes if there is a substantial proportion of 
close relatives in the data [30,31]. This, however, affects GWAS 
result in a similar way, where the SE of the estimate of a SNP 
effect from a single SNP analysis (e.g. linear regression for a 
quantitative trait and logistic regression for a case-control study) 
will be deflated, causing an inflation of the test-statistics GWAS 
(often called "genomic inflation" [32]). For the estimation of 
using all common SNPs, to avoid possible confounding from 
shared environments and uncaptured causal variants, we suggest- 
ed in Yang et al. (2010) a stringent threshold, i.e. 0.025, to remove 
cryptic relatedness from the data so that the estimate of can be 



compared direcdy to the results from GWAS in response to the 
"missing heritability" problem [2]. In practice, observing a much 
smaller SE of using all common SNPs than that predicted from 
theory is a caveat suggesting substantial cryptic relatedness 
remaining in the data. 

Using the same experimental design of a sample of conven- 
tionally unrelated individuals, the experimenter can increase 
power by increasing sample size. Fortunately, power increases 
quadratically with sample size because every new sample is 
contrasted with all existing samples. The sampling variance of the 
estimate of the genetic correlation is generally much larger than 
that of the proportion of variance explained from a univariate 
analysis, consistent with the theory of the sampling variance of 
genetic correlations in pedigree designs [33]. 
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Figure SI Likelihood ratio test (LRT) statistic vs. Chi-squared 

test-statistic in a univariate analysis. 
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Table SI Standard error of the estimate of (variance 
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calculated from our approximation theory. 
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